# Parallelism (PAR)

Unit 3: Introduction to parallel architectures (or in other words, where the data sharing overheads come from?)

Eduard Ayguadé, Daniel Jiménez and Gladys Utrera

Computer Architecture Department Universitat Politècnica de Catalunya

Course 2021/22 (Spring semester)

### Learning material for this lesson

- ► Atenea: Unit 3.1 Introduction to parallel architectures I
  - ► Video lesson 4: UMA architectures
  - Quizzes after different parts in video lesson 4
- Atenea: Unit 3.2 Introduction to parallel architectures II
  - Video lesson 5: NUMA architectures
  - Quizzes after different parts in video lesson 5
- These slides to dive deeper into UMA and NUMA architectures
- Collection of Exercises: problems in Chapter 3



#### Uniprocessor parallelism

Symmetric multi-processor architectures

Video lesson 4

Hardware support for coherence (I)

Who is causing coherence traffic? true and false sharing

#### Multicore architectures

Non-Uniform Memory Architectures

Video lesson 5

Hardware support for coherence (II)

### Pipelined and superscalar architecture

- Execution of single instruction divided in multiple stages
- Overlap the execution of different stages of consecutive instructions (pipelined) and consecutive instructions (superscalar)



#### SIMD architecture

▶ DLP (data-level parallelism): Single-Instruction executed on Multiple-Data (SIMD): vector functional unit



### Memory hierarchy

 Addressing the yearly increasing gap between CPU cycle and memory access times



Size vs. access time



► Non-blocking design



Multicores

### Memory hierarchy

- ▶ The principle of locality: if an item is referenced ...
  - Temporal locality: ... it will tend to be referenced again soon (e.g., loops, reuse)
  - ► Spatial locality: ... items whose addresses are close by tend to be referenced soon (e.g., straight line code, array access)
- Line (or block)
  - A number of consecutive words in memory (e.g. 32 bytes, equivalent to 4 words x 8 bytes)
  - Unit of information that is transferred between two levels in the hierarchy
- On an access to a level in the hierarchy
  - Hit: data appears in one of the lines in that level
  - Miss: data needs to be retrieved from a line in the next level

### Elements of cache design

- Organization
  - ► Single vs. multilevel cache, Unified vs. split (instruction/data)
  - Cache size and line size
  - Addressing: logical vs. physical
- ► Placement algorithm
  - Direct, associative, set associative
- Replacement algorithm
  - Random, Least Recently Used (LRU), First-in First-out (FIFO), Least Frequently Used (LFU)
- Write (on hit) Policy
  - Write-through, Write-back
- Write (on miss) Policy
  - ► Write-allocate, write-no-allocate



# Who exploits this uniprocessor parallelism and memory organization?

In theory, the compiler understands all of this ... but in practice the compiler may need your help, for example:

- Software pipelining to statically schedule ILP
- Vectorization to efficiently exploit SIMD vector units
- Data contiguous in memory and aligned to cache lines
- Blocking (or tiling) to define a problem that fits in register/L1-cache/L2-cache (temporal locality)

Reasons and techniques explored in detail in PCA course (Architecture-Conscious Programming)

Uniprocessor parallelism

Symmetric multi-processor architectures

Video lesson 4

Hardware support for coherence (I)

Who is causing coherence traffic? true and false sharing

Multicore architectures

Non-Uniform Memory Architectures

Video lesson 5

Hardware support for coherence (II)

### Classification of multi-processor architectures

| Memory<br>architecture                         | Address<br>space(s)                                           | Connection                                     | Model for data sharing                                 | Names                                                                                                  |
|------------------------------------------------|---------------------------------------------------------------|------------------------------------------------|--------------------------------------------------------|--------------------------------------------------------------------------------------------------------|
| (Centralized)<br>Shared-memory<br>architecture | Single shared<br>address space,<br>uniform access<br>time     | Processor Processor Main memory                | Load/store<br>instructions from<br>processors          | SMP (Symmetric Multi-<br>Processor) architecture     UMA (Uniform Memory<br>Access) architecture       |
| Distributed-<br>memory<br>architecture         | Single shared<br>address space,<br>non-uniform<br>access time | Processor  Adain memory  Main memory           | Load/store<br>instructions from<br>processors          | DSM (Distributed-Shared<br>Memory architecture     NUMA (Non-Uniform<br>Memory Access)<br>architecture |
|                                                | Multiple<br>separate<br>address spaces                        | Processor  Processor  Main memory  Main memory | Explicit messages<br>through network<br>interface card | Message-passing<br>multiprocessor     Cluster Architecture     Multicomputer                           |

#### Uniprocessor parallelism

#### Symmetric multi-processor architectures

#### Video lesson 4

Hardware support for coherence (I)

Who is causing coherence traffic? true and false sharing

#### Multicore architectures

#### Non-Uniform Memory Architectures

Video lesson 5

Hardware support for coherence (II)

### Concepts in video lesson 4

- Centralised shared-memory architectures, also called UMA (Uniform Memory Access time) or SMP (Symmetric) multiprocessors
- Cache coherence problem in centralised shared-memory architectures
  - Programmer vs. hardware views
  - Two snooping-based solutions to cache incoherence: write-update vs. write invalidate



### Concepts in video lesson 4 (cont.)

#### The coherence problem:



Chart shows value of **foo** (variable stored at address X) stored in main memory and in each processor's cache \*\*\*

| Action                             | P1 \$ | P2 \$ | P3 \$ | P4\$ | mem[X] |
|------------------------------------|-------|-------|-------|------|--------|
|                                    |       |       |       |      | 0      |
| P1 load X                          | 0 mis | s     |       |      | 0      |
| P2 load X                          | 0     | 0 mi  | ss    |      | 0      |
| P1 store X                         | 1     | 0     |       |      | 0      |
| P3 load X                          | 1     | 0     | Ø mi  | ss   | 0      |
| P3 store X                         | 1     | 0     | 2     |      | 0      |
| P2 load X                          | 1     | 0 hi  | t 2   |      | 0      |
| P1 load Y<br>(say this load causes |       | 9     | 2     |      | 1      |

eviction of foo)

<sup>\*\*</sup> Assumes write-back cache behavior

### Concepts in video lesson 4 (cont.)

#### Coherence protocols:

- Write-update: writing processor broadcasts the line with the new value and forces all others to update their copies
- Write-invalidate: writing processor forces all others to invalidate their copies; the line with the new value is provided to others when requested or when flushed from cache

#### Coherence mechanisms:

- Broadcast-based (snooping): bus serves as broadcast mechanism to maintain coherency among copies of the same memory line in caches
- Directory-based: the sharing status of each line in memory is kept centralised in just one location (directory)

Uniprocessor parallelism

#### Symmetric multi-processor architectures

Video lesson 4

Hardware support for coherence (I)

Who is causing coherence traffic? true and false sharing

Multicore architectures

#### Non-Uniform Memory Architectures

Video lesson 5

Hardware support for coherence (II)

### Broadcast-based (snooping) coherence mechanism

- Cache coherence is maintained at cache line granularity, NOT at the individual words inside the cache line
- Every cache that has a copy of a line from physical memory keeps its sharing status (status distributed)
- Broadcast medium (e.g. a bus) used to make all transactions visible to all caches and define ordering
- Caches monitor (snoop on) the medium and take action on relevant events (SCC: snoopy cache controllers)



### Dual-ported caches to support coherence (optional)

Listen to commands both from processor and from broadcast medium (e.g. bus)



### Simple write-invalidate snooping protocol (MSI)

- ▶ A line in a cache memory can be in three different states:
  - ► Modified (M): dirty copy of the line
  - ► Shared (S): clean copy of the line
  - Invalid (I): invalidated copy of the line (not valid), or it does not exist in cache
- CPU events
  - PrRd (Processor read)
  - PrWr (Processor write)
- Bus events (caused by cache controllers)
  - BusRd: asks for copy with no intent to modify
  - BusRdX: asks for copy with intent to modify
  - BusUpgr: asks for permission to modify existing line, causes invalidation of other copies
  - ► Flush: puts line on bus, either because requested or voluntarily when dirty line in cache is replaced (WriteBack)



Multicores

### Simple write-invalidate snooping protocol (MSI)



- ▶ Who provides the line when requested via BusRd or BusRdX?
  - ▶ If line in S or I in other caches then main memory provides it
  - ▶ If line in M in another cache then this cache provides it (Flush)



### MSI optimizations. Thread-private lines

- MSI requires two bus transactions for the common case of read followed by write, both from the same processor (no sharing at all)
  - ► Transaction 1: BusRd to move from I to S state
  - Transaction 2: BusUpgr to move from S to M state
- ► **MESI** protocol adds E (Exclusive) clean state:
  - ► Cache line in E if only one clean copy of the line
  - If write access by the same processor, the upgrade from E to M does not require a bus transaction (BusUpgr)
  - ▶ If line in E and another cache requests it then cache line state changes from E to S

### Optional: MSI optimizations. Cache-to-cache transfers

- Does main memory need to be updated when flushing? MOSI protocol adds O (Owned) state:
  - When flushing, state in cache for the line transitions from M to O (dirty since main memory is not updated)
  - Cache with line in O state is responsible for providing data when requested (not main memory)
  - ▶ Other caches maintain shared line in S state
  - Main memory updated when line in O is replaced from cache



### Optional: MSI optimizations. Cache—to—cache transfers

- ▶ Does main memory need to supply data if already shared in another cache? MSIF protocol adds F (Forward) state:
  - Which cache should provide the line if several copies?
  - Cache with line in F state is responsible for providing data when requested (not main memory)
  - ▶ Last cache asking for line transitions to F state (temporal locality), others transition/keep it S
- Combined use possible (MESIF/MOESI/MOESIF)

Uniprocessor parallelism

#### Symmetric multi-processor architectures

Video lesson 4

Hardware support for coherence (I)

Who is causing coherence traffic? true and false sharing

Multicore architectures

#### Non-Uniform Memory Architectures

Video lesson 5

Hardware support for coherence (II)

#### True vs. false sharing

- True sharing
  - Data sharing is unavoidable in parallel computing. Coherence mechanisms are there to allow this data sharing; synchronization allows to share appropriately.

Multicores

- False sharing
  - Cache line may also introduce artefacts: more that 1 (distinct) data object, or also multiple elements of same object, may reside in the same cache line
  - False sharing occurs when different processors make references (read and write) to those different objects or elements within the same cache line, thereby inducing "unnecessary" coherence operations.

Multicores

### True sharing example

Assume each task is executing an instance of the following dot\_product function:

The line containing variable result is subject to coherence actions at each iteration of i. It could be easily transformed into

```
void dot_product(int *A, int *B, int n) {
  int tmp = 0; i< n; i++)
    tmp += A[i] * B[i];
  #pragma omp atomic
  result += tmp;
}</pre>
```

to reduce cache coherence traffic. **Note:** atomic is used to guarantee exclusive access to variable result

### False sharing example

```
struct foo {
  int x, y; //x and y will reside in same cache line
} f; // aligned to cache line

void main() {
  int s=0;
  #pragma omp parallel
  #pragma omp single
  {
    #pragma omp task shared(s)
    for (int i=0;i<1000000;i++)
        s+=f.x
    #pragma omp task
    for (int i=0;i<1000000;i++)
        f.y++
}
}</pre>
```

```
How data is stored in memory?
cache line

f.x f.y

Assumptions:
- Variable f aligned to cache line
- Cache line 16 bytes wide
- int occupies 4 bytes
```

False sharing of the line containing fields x and y. How could we force fields x and y to be in different cache lines?

Multicores

## False sharing example (cont.)

```
struct foo {
  int x;
  int padding[3];
  int y; //x and y will NOT reside in same cache line
} f; // aligned to cache line

void main() {
  int s=0;
  #pragma omp parallel
  #pragma omp single
  {
    #pragma omp task shared(s)
    for (int i=0;i<1000000;i++)
        s+=f.x
    #pragma omp task
    for (int i=0;i<1000000;i++)
    f.y++
}</pre>
```

```
How data is stored in memory?

cache line

f.x f.y

Assumptions:

- Variable 1 aligned to cache line
- Cache line 16 bytes wide
- int occupies 4 bytes

Padding to avoid false sharing
cache line

cache line

f.x paddIng[3] f.y
```

Add a dummy field in between to separate fields x and y in different cache lines: int padding[PAD\_SIZE]; How much padding?



Uniprocessor parallelism

#### Symmetric multi-processor architectures

Video lesson 4

Hardware support for coherence (I)

Who is causing coherence traffic? true and false sharing

#### Multicore architectures

#### Non-Uniform Memory Architectures

Video lesson 5

Hardware support for coherence (II)

Multicores

### Transistors, frequency, power, performance and ... cores!

An inflexion point in 2004 ... the power wall



Original data up to the year 2010 collected and proteed by M. Horowitz, F. Laborite, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2017 by K. Rupp



- ▶ The increasing number of transistors on a chip is used to accommodate multiple processors (cores) on a single chip
- Usually private caches (up to a certain cache level) and one last-level cache (LLC)
- Coherence maintained at the LLC level
- Chip or socket boundary, access to main memory
- Multicore = Chip Multi-Processor (CMP)



Multicores 00000

### Example: multicore socket based on Intel Nehalem i7



### Example: scalable multi-socket systems

Each socket is a multicore processor with a number of cores inside (e.g. SKL up to 24) and connected to memory (DDR DIMMs)



UPI/QPI ports to interconnect sockets and provide cache-coherent shared memory (but not uniform access time anymore!)

Uniprocessor parallelism

Symmetric multi-processor architectures

Video lesson 4

Hardware support for coherence (I)

Who is causing coherence traffic? true and false sharing

Multicore architectures

Non-Uniform Memory Architectures

Video lesson 5

Hardware support for coherence (II)

Uniprocessor parallelism

#### Symmetric multi-processor architectures

Video lesson 4

Hardware support for coherence (I)

Who is causing coherence traffic? true and false sharing

#### Multicore architectures

#### Non-Uniform Memory Architectures

Video lesson 5

Hardware support for coherence (II)

### Concepts in video lesson 5

- NUMA architecture
  - Main memory distributed across multiple nodes
  - Non-uniform memory access
- Directory to keep track of the status of memory lines
  - Directory divided into "slices", one slice per node
  - Each slice serves the memory lines in the node, one entry per memory line
- Directory entry: clean/dirty line. list of sharer nodes



Uniprocessor parallelism

Symmetric multi-processor architectures

Video lesson 4

Hardware support for coherence (I)

Who is causing coherence traffic? true and false sharing

Multicore architectures

#### Non-Uniform Memory Architectures

Video lesson 5

Hardware support for coherence (II)

### Scaling of the broadcast mechanism

- Snooping schemes broadcast coherence messages to determine the state of a line in the other caches
  - Processor initiating access sends command to ALL other processors (having or not copy of the line)
  - Could be extended to support coherence in small NUMA systems, but does not scale to large number of nodes (excessive coherence traffic)
- Alternative: avoid broadcast by storing information about the status of each line in main memory, in the so called directory divided in slices, one slice per node
  - ► Each slice of the directory tracks the location of copies in caches of its memory lines
  - Coherence is maintained by point-to-point messages between the nodes



#### MSU directory-based cache coherency

- One slice of the directory associated to each node memory: one entry per line of memory
  - ▶ **Status bits**: they track the state of cache lines in its memory
  - Sharers list: tracks the list of remote nodes having a copy of a line. For small-scale systems, implemented as a bit string



Sharers list: nodes currently having the line

- Bit string
- 1 bit per node, position indicates node
- If 64 byte block size: 12.5% overhead (64 nodes), 50% (256 nodes), 200% (1024 nodes)
- ▶ Directory slice is the "centralised" structure that "orders" the accesses to the lines in the associated node



### Directory-based cache coherency (cont.)

- Who is involved in maintaining coherence of a memory line?
  - ▶ **Home** node: node where the line is allocated. It has the directory slice with the information to maintain its coherence.
  - ▶ Local node: node with the processor accessing the line
  - Remote nodes: Owner node containing dirty copy or Reader nodes containing clean copies of the line
- But ... how the **home** node for a memory line is decided?
  - **OS managed**, for example using a policy named *first touch* 
    - ► The node that first "touches" a page will be the home node for all the lines in that page
    - For example, if memory pages are P=4 KBytes and memory lines are L=128 Bytes, then a page will contain P/L=32 consecutive memory lines
    - Unless indicated differently we will assume that the number of memory lines in a page is 1 (i.e. P = L)



### Simplified coherency protocol

Possible commands arriving to home node from local node:

- ▶ RdReq: asks for copy of line with no intent to modify
- WrReq: asks for copy of line with intent to modify
- UpgrReq: asks for permission to modify an existing line, invalidating all other copies

As a result of **RdReq** and **WrReq** the home node sends clean copy of line (**Dreply** command to local node). For **UpgrReq** it sends an acknowledgment (**Ack** command) to give permission.

If needed the home node may generate other commands to remote nodes:

- ► Fetch: asks remote (owner) node for a copy of line (Dreply)
- ► Invalidate: asks remote (reader) node to invalidate its copy, remote sends confirmation to home (Ack)



### Directory-based cache coherency: example

Write miss to clean line with two sharers

- ▶ Local node where the miss request originates: processor 1
- ▶ Home node where the memory line resides: processor 2
- Copies of line in caches of remote processors 2 and 3



### Snooping- and directory-based protocols together!

If nodes have snoopy-based coherence, then the hub becomes an additional agent that interacts with the home (directory) nodes for the cache lines copied in the node



#### Coherence commands

- Core: PrRd, and PrWr., being I the core number doing the action
- Snoopy: BusRdi, BusRdXi BusUpgri and Flushi, being the snoopy/cache number doing the action
- Hub/directoty: RdReq<sub>i→i</sub>, WrReq<sub>i→i</sub>, UpgrReq<sub>i→i</sub>, Dreply<sub>i→i</sub>, Fetch<sub>i→i</sub>, Invalidate<sub>i→i</sub>, Ack<sub>i→i</sub> and WriteBack, from NUMAnode i to NUMAnode i

#### Line state in cache

- M (Modified), S (Shared), I (Invalid) Line state in main memory
- M (Modified), S (Shared), U (Uncached)



Uniprocessor parallelism

Symmetric multi-processor architectures

Video lesson 4

Hardware support for coherence (I)

Who is causing coherence traffic? true and false sharing

Multicore architectures

#### Non-Uniform Memory Architectures

Video lesson 5

Hardware support for coherence (II)

#### Data sharing and initialization

- True and false sharing have now a much higher penalty
- What may be wrong with data initialization? Be aware of "first touch"

```
for (int i=0; i<128; i++) {
   a[i] = random();
   b[i] = random();
}</pre>
```

Vectors a and b are allocated in a single node of of the NUMA system, as follows

```
M<sub>0</sub>
```

```
#pragma omp parallel num_threads(4)
{
  int myid = omp_get_thread_num();
  int BS = 128 / omp_get_num_threads();
  for (int i=myid*BS; i<(myid*1)*BS; i*++) {
    a[i] = random();
    b[i] = random();
}</pre>
```

Vectors a and b are distributed across the memories of the NUMA system, as follows

| Mo  | $M_1$ | $M_2$ | $M_3$ |
|-----|-------|-------|-------|
| 031 | 3263  | 6495  | 96127 |

Ma

### Data sharing and initialization

```
#pragma omp parallel num_threads(4)
{
  int myid = omp_get_thread_num();
  int BS = 128 / omp_get_num_threads();
  for (int i=myid*BS; i<(myid*1)*BS; i++)
    b[i] = foo1(a[i]);
  for (int i=myid*BS; i<(myid*1)*BS; i++)
    a[i] = foo2(b[i]);
}</pre>
```

 $M_0$ 

|      | 0127                            |                                 |                                 |                                 |
|------|---------------------------------|---------------------------------|---------------------------------|---------------------------------|
|      | P <sub>0</sub> @ M <sub>0</sub> | P <sub>1</sub> @ M <sub>1</sub> | P <sub>2</sub> @ M <sub>2</sub> | P <sub>3</sub> @ M <sub>3</sub> |
| for1 | 031                             | 3263                            | 6495                            | 96127                           |
| IOFI | 031                             | 3263                            | 6495                            | 96127                           |
| for2 | 031                             | 3263                            | 6495                            | 96127                           |

X.y No coherence traffic

|      | 031                   | 3263                            | 6495                            | 96127                           |
|------|-----------------------|---------------------------------|---------------------------------|---------------------------------|
|      |                       |                                 |                                 |                                 |
|      | $P_0 \mathbin{@} M_0$ | P <sub>1</sub> @ M <sub>1</sub> | P <sub>2</sub> @ M <sub>2</sub> | P <sub>3</sub> @ M <sub>3</sub> |
|      |                       |                                 |                                 |                                 |
| for1 | 031                   | 3263                            | 6495                            | 96127                           |
|      |                       |                                 |                                 |                                 |
| for2 | 031                   | 3263                            | 6495                            | 96127                           |

Mэ

 $M_3$ 

x..y No coherence traffic



### Parallelism (PAR)

Unit 3: Introduction to parallel architectures (or in other words, where the data sharing overheads come from?)

Eduard Ayguadé, Daniel Jiménez and Gladys Utrera

Computer Architecture Department Universitat Politècnica de Catalunya

Course 2021/22 (Spring semester)